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Abstract 

I  describe  an  exploration  criterion  that  attempts  to  minimize  the  error  of  a  learner  by  minimizing  its 
estimated  squared  bias.  I  describe  experiments  with  locally-weighted  regression  on  two  simple  kinematics 
problems,  and  observe  that  this  “bias-only”  approach  outperforms  the  more  common  “variance-only” 
exploration  approach,  even  in  the  presence  of  noise. 
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1  Introduction 

In  recent  years,  there  has  been  an  explosion  of  interest 
“active”  machine  learning  systems.  These  are  learning 
systems  that  make  queries,  or  perform  experiments  to 
gather  data  that  are  expected  to  maximize  performance. 
When  compared  with  “passive”  learning  systems,  which 
accept  given,  or  randomly  drawn  data,  active  learners 
have  demonstrated  signihcant  decreases  in  the  amount 
of  data  required  to  achieve  equivalent  performance.  In 
industrial  applications,  where  each  experiment  may  take 
days  to  perform  and  cost  thousands  of  dollars,  a  method 
for  optimally  selecting  these  points  would  offer  enormous 
savings  in  time  and  money. 

An  active  learning  system  will  typically  attempt  to 
select  data  that  will  minimize  its  predictive  error.  The 
error  of  a  learner  can  be  decomposed  into  bias  and  vari¬ 
ance  terms.  Most  research  in  selecting  optimal  actions 
or  queries  has  assumed  that  the  learner  is  approximately 
unbiased,  and  that  to  minimize  learner  error,  variance  is 
the  only  thing  to  minimize  (a  few  examples  include  Fe¬ 
dorov  [1972],  MacKay  [1992],  Cohn  [1994;  1995],  Paass 
[1995]).  In  practice,  however,  there  are  very  few  prob¬ 
lems  for  which  we  have  unbiased  learners.  Frequently, 
bias  constitutes  a  large  portion  of  a  learner’s  error;  if 
the  learner  is  deterministic  and  the  data  are  noise-free, 
then  bias  is  the  only  source  of  error. ^ 

In  this  paper  I  describe  an  algorithm  which  selects 
actions/queries  designed  to  minimize  the  bias  of  a  lo¬ 
cally  weighted  regression-based  learner.  Empirically, 
“variance-minimizing”  strategies  which  ignore  bias  seem 
to  perform  well,  even  in  cases  where,  strictly  speaking, 
there  is  no  variance  to  minimize.  In  the  tasks  considered 
in  this  paper,  the  bias-minimizing  strategy  consistently 
outperforms  variance  minimization,  even  in  the  presence 
of  noise. 

1.1  Bias  and  variance 

Let  us  begin  by  dehning  P(x,  y)  to  be  the  unknown  joint 
distribution  over  x  and  y,  and  P{x)  to  be  the  known 
marginal  distribution  of  x  (commonly  called  the  input 
distribution).  We  denote  the  learner’s  output  on  input 
X,  given  training  set  T>  as  y(x',T>).  We  can  then  write 
the  expected  error  of  the  learner  as 

f  E  \(y(x;V)  -  y(x))^  \x]  P(x)dx,  (1) 

J  X  ^ 

where  E[-]  denotes  the  expectation  over  P  and  over  train¬ 
ing  sets  T>.  The  expectation  inside  the  integral  may  be 
decomposed  as  follows  (Geman  et  ah,  1992): 

E  ^(y(x;V)  -  y(x))^  \x'^  =  (2) 

E  [(yix)  -  E[y\x])‘^j  (3) 

+  (Ev  [y(x;V)]  -  E[y\x])'^ 

+Ev  [(y{x;V)  -  Ev[y(x;V)])^] 

^The  bias  term  here  is  a  statistical  hia.s,  which  is  distinct 
from  the  inductive  bias  discussed  in  some  machine  learning 
research.  See  Dietterich  and  Kong  [1995]  for  a  discussion  of 
the  relationship  between  the  two. 


where  Ej))-]  denotes  the  expectation  over  training  sets. 
The  Rrst  term  in  Equation  2  is  the  variance  of  y  given  x 
-  it  is  the  noise  in  the  distribution,  and  does  not  depend 
on  our  learner  or  how  the  training  data  are  chosen.  The 
second  term  is  the  learner’s  squared  bias,  and  the  third  is 
its  varianee;  these  last  two  terms  comprise  the  expected 
squared  error  of  the  learner  with  respect  to  the  regression 
function  £'[t/|a;]. 

Most  research  in  active  learning  assumes  that  the  sec¬ 
ond  term  of  Equation  2  is  approximately  zero,  that  is, 
that  the  learner  is  unbiased.  If  this  is  the  case,  then 
one  may  concentrate  on  selecting  data  so  as  to  minimize 
the  variance  of  the  learner.  Although  this  “all-variance” 
approach  is  optimal  when  the  learner  is  unbiased,  truly 
unbiased  learners  are  rare.  Even  when  the  learner’s  rep¬ 
resentation  class  is  able  to  match  the  target  function 
exactly,  bias  is  generally  introduced  by  the  learning  al¬ 
gorithm  and  learning  parameters.  From  the  Bayesian 
perspective,  a  learner  is  only  unbiased  if  its  priors  are 
exaetly  correct. 

The  optimal  choice  of  query  would,  of  course,  mini¬ 
mize  both  bias  and  variance,  but  I  leave  that  for  future 
work.  For  the  purposes  of  this  paper,  I  will  only  be  con¬ 
cerned  with  selecting  queries  that  are  expected  to  min¬ 
imize  learner  bias.  This  approach  is  justified  in  cases 
where  noise  is  believed  to  be  only  a  small  component 
of  the  learner’s  error.  If  the  learner  is  deterministic  and 
there  is  no  noise,  then  strictly  speaking,  there  is  no  error 
due  to  variance  —  all  the  error  must  be  due  to  learner 
bias.  In  cases  with  non-determinism  or  noise,  all-bias 
minimization,  like  all- variance  minimization,  becomes  an 
approximation  of  the  optimal  approach. 

The  learning  model  discussed  in  this  paper  is  a  form 
of  locally  weighted  regression  (LWR)  [Cleveland  et  ah, 
1988],  which  has  been  used  in  difficult  machine  learning 
tasks,  notably  the  “robot  juggler”  of  Schaal  and  Atkeson 
[1994].  Previous  work  [Cohn  et  ah,  1995]  discussed  all¬ 
variance  query  selection  for  LWR;  in  the  remainder  of 
this  paper,  I  describe  a  method  for  performing  all-bias 
query  selection.  Section  2  describes  the  criterion  that 
must  be  optimized  for  all-bias  query  selection.  Section  3 
describes  the  locally  weighted  regression  learner  used  in 
this  paper  and  describes  how  the  all-bias  criterion  may 
be  computed  for  it.  Section  4  describes  the  results  of  ex¬ 
periments  using  this  criterion  on  several  simple  domains. 
Directions  for  future  work  are  discussed  in  Section  5. 

2  All-bias  query  selection 

Let  us  assume  for  the  moment  that  we  have  a  source  of 
noise-free  examples  (xi,yi)  and  a  deterministic  learner 
which,  given  input  x,  outputs  estimate  y(x).^  Let  us 
also  assume  that  we  have  an  accurate  estimate  of  the 
bias  of  y  which  can  be  used  to  estimate  the  true  func¬ 
tion  y(x)  =  y(x)  —  hias(x).  We  will  break  these  rather 
strong  assumptions  of  noise-free  examples  and  accurate 
bias  estimates  in  Section  4,  but  they  are  useful  for  de¬ 
riving  the  theoretical  approach  described  below. 

^For  clarity,  I  will  drop  the  argument  x  except  where  re¬ 
quired  for  disambiguation.  I  will  also  denote  only  the  uni¬ 
variate  case;  the  results  apply  in  higher  dimensions  as  well. 


Given  the  accurate  bias  estimate,  our  task  is  then  to 
force  the  biased  estimator  into  the  best  approximation  of 
y(x)  with  the  fewest  number  of  examples.  This,  in  effect, 
transforms  the  query  selection  problem  into  an  example 
Rlter  problem  similar  to  that  studied  by  Plutowski  and 
White  [1993]  for  neural  networks.  Below,  I  derive  this 
criterion  for  estimating  the  change  in  error  at  x  given  a 
new  queried  example  at  x. 

Since  we  have  (temporarily)  assumed  a  deterministic 
learner  and  noise-free  data,  the  expected  error  in  Equa¬ 
tion  2  simplifies  to: 

E  ^(y(x;  V)  -  y{x)  f  \x,  T>]  =  (y(x;  V)  -  t/(*))^(4) 

We  want  to  select  a  new  x  such  that  when  we  add 
(x,  y),  the  resulting  squared  bias  is  minimized: 

{y  -  y  f  =  {y{x-,  V  U  (£,  y))  -  y{x)  f  .  (5) 

We  will,  for  the  remainder  of  the  paper,  use  the  to 
indicate  estimates  based  on  the  initial  training  set  plus 
the  additional  example  (x,y).  To  minimize  Expression  5, 
we  need  to  compute  how  a  query  at  x  will  change  the 
learner’s  bias  at  *.  If  we  assume  that  we  know  the  input 
distribution,^  then  we  can  integrate  this  change  over  the 
entire  domain  (using  Monte  Carlo  procedures)  to  esti¬ 
mate  the  resulting  average  change,  and  select  a  x  such 
that  the  expected  squared  bias  is  minimized.  Defining 
bias  =  y  —  y  and  At/  =  y'  —  y,  we  can  write  the  new 
squared  bias  as: 


and  we  could  concentrate  solely  on  variance,  as  previous 
work  has. 

The  answer  to  this  question  has  several  parts,  the  first 
of  which  is  that  for  most  learners,  there  are  no  perfect 
bias  estimators.  Bias  estimators  introduce  their  own  bias 
and  variance,  which  must  be  addressed  in  data  selection. 

We  can  define  a  composite  learner  which  produces  es¬ 
timate  jjc  =  y  —  bias.  Given  a  random  training  sample 
then,  we  would  expect  j/c  to  outperform  y.  However, 
there  is  no  obvious  way  to  select  data  for  this  composite 
learner  other  than  selecting  to  maximize  the  performance 
of  its  two  components.  In  our  case,  the  second  compo¬ 
nent  (the  bias  estimate)  is  non-analytic,  which  leaves 
us  selecting  data  so  as  to  maximize  the  performance  of 
the  first  component  (the  uncorrected  estimator).  We 
are  now  back  to  our  original  problem:  we  can  select 
data  so  as  to  minimize  either  the  bias  or  variance  of 
the  uncorrected  LWR-based  learner.  Since  the  purpose 
of  the  correction  is  to  give  an  unbiased  estimator,  intu¬ 
ition  suggests  that  variance  minimization  would  be  the 
more  sensible  route  in  this  case. 

Regardless  of  how  we  select  our  data,  we  can  use  the 
composite  estimator  to  make  our  predictions;  depending 
on  how  noisy  the  bias  estimate  is,  this  may  or  may  not 
improve  the  learner’s  net  performance.  In  the  domains 
considered  in  this  paper,  I  found  that  the  performance  of 
t/c  using  random  selection  or  variance  minimization  was 
not  substantially  different  from  that  of  the  uncorrected 
y  (see  Eigure  7  in  Section  4). 


bias'  =  {y'  -  yf 

=  (y  +  Ay-yf 

=  Af  +  2Ay  ■  bias  +  bias^  (6) 

Note  that  since  bias  as  defined  here  is  independent  of  x, 
minimizing  the  bias  is  equivalent  to  minimizing  Aj/^  + 
2Ay  ■  bias. 

The  estimate  of  bias'  tells  us  how  much  our  bias  will 
change  for  a  given  x.  We  may  optimize  this  value  over  x 
in  one  of  a  number  of  ways.  In  low  dimensional  spaces, 
it  is  often  sufhcient  to  consider  a  set  of  “candidate”  x 
and  select  the  one  promising  the  smallest  resulting  error. 
In  higher  dimensional  spaces,  it  is  often  more  efhcient  to 
search  for  an  optimal  x  with  a  response  surface  technique 
[Box  and  Draper,  1987],  or  hillclimb  on  dbias' / dx. 

Estimates  of  bias  and  Ay  depend  on  the  specific  learn¬ 
ing  model  being  used.  In  Section  3,  I  describe  a  locally 
weighted  regression  model,  and  show  how  differentiable 
estimates  of  bias  and  Ay  may  be  computed  for  it. 

2.1  An  aside:  why  not  jnst  nse  y  —  bias? 

If  we  have  an  accurate  bias  estimate,  it  is  reasonable  to 
ask  why  we  do  not  simply  use  the  corrected  y  —  bias 
as  our  predictor.  Gertainly,  in  the  limit  of  a  perfect  bias 
estimate,  the  composite  prediction  would  have  zero  bias, 

■^This  assumption  is  contrary  to  the  assumption  normally 
made  in  some  forms  of  learning,  e.g.  PAC-learning,  but  it  is 
appropriate  in  many  domains.  If,  for  example,  we  are  learn¬ 
ing  to  control  a  robot  arm,  it  is  reasonable  to  assume  that 
we  know  the  distribution  of  positions  over  which  we  are  in¬ 
terested  in  controlling  it. 


3  Locally  weighted  regression 


The  type  of  learner  I  consider  here  is  a  form  of  locally 
weighted  regression  (LWR)  that  is  a  slight  variation  on 
the  LOESS  model  of  Gleveland  et  al.  [1988] .  The  LOESS 
model  performs  a  linear  regression  on  points  in  the  data 
set,  weighted  by  a  kernel  centered  at  x  (see  Eigure  1). 
The  kernel  shape  is  a  design  parameter:  the  original 
LOESS  model  uses  a  “tricubic”  kernel;  in  my  experi¬ 
ments  I  use  the  more  common  Gaussian 


hi(x)  =  h{x  —  Xi)  =  exp(—k(x  —  Xif), 
where  A;  is  a  smoothing  parameter.  Eor  brevity,  I  will 
drop  the  argument  x  for  hfx),  and  define  n  =  ffihi. 
We  can  then  write  the  weighted  means  and  covariances 
as: 

{xi  -  xf 


n 

I 

yy  =  2^^i- 


2  _  2 

^y\x  —  ~ 


yj_ 

n 

r2 

^2  ■ 


=  E''- 

i 


iVi  -  ky  f 


_  U  “  ^)iyi  ~  ky) 

'^xy  —  2_^  ni  ^ 


We  use  these  means  and  covariances  to  produce  an  esti¬ 
mate  y  at  the  x  around  which  the  kernel  is  centered,  with 
a  confidence  term  in  the  form  of  a  variance  estimate: 


y  =  yy  +  ~  yf 


2  _  2 

^y  ~  ^y\x 


1  + 


(x  -  iax)(xi  -  Hx) 


In  all  the  experiments  discussed  in  this  paper,  the 
smoothing  parameter  k  was  set  so  as  to  minimize  (t|. 


X 


Figure  1:  Locally  weighted  regression  places  a  kernel 
around  the  point  of  interest  x.  The  kernel  is  used  to  as¬ 
sign  weightings  to  points  in  the  training  set,  from  which 
y(x)  is  computed  via  linear  regression. 


Best  local  linear  fit 


Best  local  quadratic  fit 


Figure  2:  Box  and  Draper’s  method  of  estimating  bias 
measures  the  difference  between  the  estimator  in  ques¬ 
tion  and  a  one-higher-order  estimate. 


The  low  cost  of  incorporating  new  training  examples 
makes  this  form  of  locally  weighted  regression  appealing 
for  learning  systems  which  must  operate  in  real  time, 
or  with  time-varying  target  functions  (e.g.  ISchaal  and 
Atkeson  1994]). 


3.1  Computing  Ay  for  LWR 

If  we  know  what  new  point  (x,  y)  we’re  going  to  add, 
computing  Ay  for  LWR  is  straightforward.  Dehning  h 
as  the  weight  given  to  £,  we  can  write 


At/  =  y'  -y 


(7) 


=  h 


^^y)  _  1^/™  _  \ 

n  +  h  ay 


~\~  [  X  — 


nyx 


hi 


n  +  h  n  +  h  ^ 

(n  +  h)axy  +  h  -  (x  -  fix){y  -  hy) 
(n  +  h)al  +  h  ■  (x  - 


(8) 


In  the  query  selection  problem,  we  must  be  able  to 
estimate  the  bias  at  all  possible  x.  There  are  several  ways 
we  can  get  this  estimate  using  LWR.  Box  and  Draper 
[1987]  suggest  Rtting  a  higher  order  model  and  measuring 
the  difference.  In  the  case  of  (linear)  locally  weighted 
regression,  one  would  fit  a  locally  quadratic  regressor  to 
the  data  and  use  the  difference  in  estimates  as  the  bias 
(see  Figure  2).  Under  certain  conditions  on  the  higher- 
order  bias  terms,  one  can  make  some  guarantees  on  the 
accuracy  of  this  bias  estimate. 

The  disadvantages  of  this  method  stem  from  the  fact 
that  it  requires  a  higher  order  model.  This  requires  ad¬ 
ditional  computation  to  Rt,  and  the  Rt  is  more  prone 
to  variance  problems.  For  the  experiments  described  in 
this  paper,  this  method  of  bias  estimation  yielded  poor 
results;  two  other  bias-estimation  techniques,  however, 
performed  very  well. 

3.2.1  Estimating  bias  by  bootstrapping 
residuals 


Note  that  computing  Ay  requires  us  to  know  both  the 
X  and  y  of  the  new  point.  In  practice,  we  only  know 
X.  If  we  assume,  however,  that  we  can  estimate  the 
learner’s  bias  at  any  x,  then  we  can  also  estimate  the 
unknown  value  y  Ri  y(x)  —  hias(x).  Below,  I  consider 
how  to  compute  the  bias  estimate. 

3.2  Estimating  bias  for  LWR 

The  most  common  technique  for  estimating  bias  is  cross- 
validation.  Standard  cross-validation  however,  only 
gives  estimates  of  the  bias  at  our  speciRc  training  points, 
which  are  usually  combined  to  form  an  average  bias  es¬ 
timate.  This  is  suIRcient  if  one  assumes  that  the  train¬ 
ing  distribution  is  representative  of  the  test  distribution 
(which  it  isn’t  in  query  learning)  and  if  one  is  content 
to  just  estimate  the  bias  where  one  already  has  training 
data  (which  we  can’t  be). 


Another  method  of  estimating  bias  is  by  bootstrap¬ 
ping  the  residuals  of  the  training  points.  Based  on  the  m 
available  training  points,  and  the  predictor’s  Rt  to  these 
points,  a  “bootstrap  sample”  is  created  by  randomly 
drawing  m  values  with  replacement  from  the  learner’s 
residuals.  These  values  are  added  to  the  original  pre¬ 
dictions  to  create  a  synthetic  training  set  on  which  the 
learner  is  retrained. 

By  creating  a  number  of  bootstrapped  predictions  and 
comparing  their  average  prediction  with  that  of  the  orig¬ 
inal  predictor,  one  arrives  at  a  Rrst-order  bootstrap  es¬ 
timate  of  the  predictor’s  bias  [Connor  1993;  Efron  and 
Tibshirani,  1993].  It  is  known  that  this  estimate  is  itself 
biased  towards  zero;  a  standard  heuristic  is  to  divide  the 
estimate  by  0.632  [Efron,  1983].  A  disadvantage  of  the 
bootstrap  method  is  that,  because  it  requires  repeated 
Rtting,  it  is  computationally  expensive. 
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3.2.2  Estimating  bias  by  fitting  cross-validated 
estimates 

One  may  also  estimate  the  bias  of  a  learner  by  fit¬ 
ting  its  own  cross-validated  residuals.  We  first  compute 
the  cross-validated  residuals  on  the  training  examples. 
These  produce  estimates  of  the  learner’s  bias  at  each 
of  the  training  points.  We  can  then  use  these  residuals 
as  training  examples  for  another  learner  (again  LWR) 
to  produce  estimates  of  what  the  cross-validated  error 
would  be  in  places  where  we  don’t  have  training  data 
(see  Figure  3).'^ 


0  training  exampies 

- true  function 

-  estimator 


How  do  we  estimate  the  bias  at  input  x? 


Compute  cross-vaiidated  residuais 


Use  estimator  (here,  LWR)  to  fit  residuals 


Figure  3:  Applying  the  learner  to  its  own  cross- validated 
residuals  produces  an  estimate  of  the  bias  that  may  be 
evaluated  over  the  entire  domain. 

4  Empirical  results 

In  the  previous  two  sections,  I  have  explained  how  having 
an  estimate  of  Ay  and  bias  for  a  learner  allows  one  to 
compute  the  learner’s  change  in  bias  given  a  new  query, 
and  have  shown  how  these  estimates  may  be  computed 
for  a  learner  that  uses  locally  weighted  regression.  Here, 
I  apply  these  results  to  several  simple  problems  using  the 
“Arm2D”  domain  (Figure  4)  and  demonstrate  that  they 
may  actually  be  used  to  select  queries  that  minimize  the 
statistical  bias  (and  the  error)  of  the  learner. 


^One  subtlety  that  needs  to  be  addressed  is  which  residual 
is  actually  ht.  Denote  the  cross- validated  estimate  as  ijcv  If 
we  believe  the  data  is  noise-free,  then  the  true  value  of  the 
function  at  x  is  y,  so  the  cross-validated  bias  is  ijcv  —  y-  If, 
however,  there  is  noise,  we  should  assume  that  some  of  that 
misht  is  due  to  noise.  In  this  case,  the  proper  bias  estimate 
should  be  ijcv  —  y,  with  the  remaining  difference  y  —  y  being 
due  to  noise. 


theta  1 


Figure  4:  (left)  Arm2D  -  The  system  learns  arm  kine¬ 
matics  by  specifying  joint  angles  (01,62)  and  observing 
tip  coordinates  (xi,X2).  The  goal  is  to  minimize  the 
MSE  of  the  learner’s  model  over  the  input  distribution 
(01,02)  =  (17[0,  27r],  (7[0,  tt]).  (right)  A  sample  explo¬ 
ration  trajectory  in  joint-space  for  the  constrained  arm 
problem,  exploring  according  to  the  cross-validation- 
based  bias  minimizing  criterion. 


4.1  Bias  estimates 

I  tested  the  accuracy  of  the  three  bias  estimators  by  ob¬ 
serving  their  correlations  on  64  reference  inputs,  given 
100  random  training  examples  from  the  Arm2D  domain. 
When  corrected  with  the  632  heuristic  described  above, 
both  the  bootstrap  and  cross-validation  methods  pro¬ 
duce  fairly  accurate,  albeit  noisy,  bias  estimates  (Fig¬ 
ure  5).  The  quadratic  method  produced  poor  correlation 
and  was  dropped  from  the  study. 
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Figure  5:  Correlations  between  estimated  and  actual  bi¬ 
ases  for  different  estimators. 


4.2  Bias  minimization 

I  ran  two  series  of  experiments  using  the  bias-minimizing 
criterion  in  conjunction  with  the  bias  estimation  tech¬ 
nique  of  the  previous  section  on  the  “Arm2D”  domain. 
The  bias  minimization  criterion  was  used  as  follows:  At 
each  time  step,  the  learner  was  given  a  set  of  64  ran¬ 
domly  chosen  candidate  queries  and  64  uniformly  cho¬ 
sen  reference  points.  It  evaluated  E'(x)  for  each  refer¬ 
ence  point  given  each  candidate  point  and  selected  for 
its  next  query  the  candidate  point  with  the  smallest  av¬ 
erage  E'(x)  over  the  reference  points.  I  compared  the 
bias-minimizing  strategy  (using  the  cross-validation  and 
bootstrap  estimation  techniques)  against  random  sam¬ 
pling  and  the  variance-minimizing  strategy  discussed  in 
Cohn  et  al.  [1995].  0n  a  Sparc  10,  with  m  training  ex¬ 
amples,  the  average  evaluation  times  per  candidate  per 


reference  point  were  58+0. 16m  //seconds  for  the  variance 
criterion,  65  +  0.53m  //seconds  for  the  cross-validation- 
based  bias  criterion,  and  83  +  3.7m  //seconds  for  the 
bootstrap-based  bias  criterion  (with  20x  resampling). 

To  test  whether  the  bias-only  assumption  was  robust 
against  the  presence  of  noise,  1%  Gaussian  noise  was 
added  to  the  input  values  of  the  training  data  in  all  ex¬ 
periments.  This  simulates  noisy  position  effectors  on  the 
arm,  and  results  in  non-Gaussian  noise  in  the  output  co¬ 
ordinate  system. 

In  the  Rrst  series  of  experiments,  the  candidate  points 
were  drawn  uniformly  over  (17[0,  2?/],  17[0,  tt]).  In  uncon¬ 
strained  domains  like  this,  random  sampling  is  a  fairly 
good  default  strategy.  The  bias  minimization  strategies 
still  significantly  outperform  both  random  sampling  and 
the  variance  minimizing  strategy  in  these  experiments 
(see  Figure  6). 


Figure  6:  MSE  as  a  function  of  number  of  noisy  train¬ 
ing  examples  for  the  unconstrained  arm  problem.  The 
cross-validation  and  bootstrap  bias-minimization  strate¬ 
gies  give  a  factor  of  3  improvement  over  random  selec¬ 
tion,  and  a  slight  improvement  over  variance-only  mini¬ 
mization.  Errors  are  averaged  over  10  runs  for  the  boot¬ 
strap  method  and  15  runs  for  all  others.  One  run  with 
the  cross- validation-based  method  was  excluded  when  k 
failed  to  converge  to  a  reasonable  value. 

In  the  second  series  of  experiments,  candidates  were 
drawn  uniformly  from  a  region  local  to  the  previously 
selected  query:  (6*1  ±  0.2?/,  62  ±  O.Itt).  This  corresponds 
to  restricting  the  arm  to  local  motions.  In  a  constrained 
problem  such  as  this,  random  sampling  is  a  poor  strat¬ 
egy;  both  the  bias  and  variance-reducing  strategies  out¬ 
perform  it  at  least  an  order  of  magnitude.  Further,  the 
bias-minimization  strategy  outperforms  variance  mini¬ 
mization  by  a  large  margin  (Figure  7).  Figure  4  shows 
an  exploration  trajectory  produced  by  pursuing  the  bias¬ 
minimizing  criterion.  It  is  noteworthy  that,  although 
the  implementation  in  this  case  was  a  greedy  (one-step) 
minimization,  the  trajectory  results  in  globally  good  ex¬ 
ploration. 

5  Discussion 

I  have  argued  in  this  paper  that,  in  many  situations,  se¬ 
lecting  queries  to  minimize  learner  bias  is  an  appropriate 


Figure  7:  MSE  as  a  function  of  number  of  noisy  train¬ 
ing  examples  for  the  constrained  arm  problem.  Bias- 
minimization  significantly  outperforms  the  variance¬ 
minimizing  algorithm  and  random  exploration. 

and  effective  strategy  for  active  learning.  I  have  given 
empirical  evidence  that,  with  a  LWR-based  learner  and 
the  examples  considered  here,  the  strategy  is  effective 
even  in  the  presence  of  noise. 

Beyond  minimizing  either  bias  or  variance,  an  impor¬ 
tant  next  step  is  to  explicitly  minimize  them  together. 
The  bootstrap-based  estimate  should  facilitate  this,  as  it 
produces  a  complementary  variance  estimate  with  little 
additional  computation.®  By  optimizing  over  both  cri¬ 
teria  simultaneously,  we  expect  to  derive  a  criterion  that 
that,  in  terms  of  statistics,  is  truly  optimal  for  selecting 
queries. 
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